Author: Kamila Makhambetova

Introduction

Research Questions:

What factors among region, GDP per capita, social support, healthy life expectancy, freedom to make life choices, generosity, perceptions of corruption, ladder score in dystopia affect the ladder score (happiness score) a lot in 2021? Is the estimated regression model between the the ladder score (happiness score) and stated above 8 factors appropriate for the application? How does it look?

Research Interest

2020 and 2021 are the hardest years for humankind in 21 century. Because of Covid 19, we face many restrictions. It is clear, people became unhappier. That’s why I decided to investigate the happiness score and factors that affect it in 2021. I chose 53 different countries from the different parts of Earth and with different happiness scores to make my data wide-spreading and unbiased.

Data

Data was taken from World Happiness Report 2021: https://databank.worldbank.org . The World Happiness Report is a publication of the Sustainable Development Solutions Network, powered by data from the Gallup World Poll and Lloyd’s Register Foundation, which provided access to the World Risk Poll. The 2021 Report includes data from the ICL-YouGov Behaviour Tracker as part of the COVID Data Hub from the Institute of Global Health Innovation. The World Happiness Report was written by a group of independent experts acting in their personal capacities.
All 9 variables were taken from World Happiness Report 2021:
1) Happiness score or subjective well-being (variable name ladder ) was taken from the Feb 26, 2021 release of the Gallup World Poll (GWP) covering years from 2005 to 2020.
2) GDP per capita was taken from the October 14, 2020 update of the World Development Indicators (WDI).
3) Healthy Life Expectancy (HLE) at birth is based on the data extracted from the World Health Organization’s (WHO) Global Health Observatory data repository (Last updated: 2020-09-28).
4) Social support, freedom to make life choices, generosity, corruption perception are the national average of the binary responses (either 0 or 1) to the GWP questions.
5) Ladder score in dystopia was calculated by creators of World Happiness Report 2021.

Definitions

Happiness score (ladder score) is the national average response to the question of life evaluations. The experts ask the residents to assess own life in this country using scaling from 0 (the worst possible life) to 10 (the best possible life).

GDP per capita (PPP based) is gross domestic product converted to international dollars using purchasing power parity rates and divided by total population.
Healthy Life Expectancy is the average number of years that a person can expect to live in “full health” by taking into account years lived in less than full health due to disease and/or injury.
Social support is the national average of the binary responses (either 0 or 1) to the GWP question “If you were in trouble, do you have relatives or friends you can count on to help you whenever you need them, or not?”
Freedom to make life choices is the national average of responses to the GWP question “Are you satisfied or dissatisfied with your freedom to choose what you do with your life?”
Generosity is the residual of regressing the national average of response to the GWP question “Have you donated money to a charity in the past month?” on GDP per capita.
Corruption Perception is the national average of the survey responses to two questions in the GWP: “Is corruption widespread throughout the government or not?” and “Is corruption widespread within businesses or not?” The overall perception is just the average of the two 0-or-1 responses.
Ladder score in dystopia is the lowest national average for each key variable and is, along with the residual error, used as a regression benchmark.

Response variable (Y): Happiness score (ladder score) in range [0,10]
Explanatory variables (\(x_i\)): GDP per capita (PPP based), region, Healthy life expectancy, social support, freedom to make life choices, Generosity, Corruption Perception, Ladder score in dystopia.

Hypothesis and Explanation (about the correlation)

Hypothesis:

Happiness score and each explanatory variable such as GDP per capita (PPP based), Healthy Life Expectancy, Social support, Freedom to make life choices, Generosity, have a positive relationship. As the rise in one of these factors causes a rise in the happiness score. Happiness score and Corruption Perception have an inverse relationship. As the rise in corruption perception causes a fall in the happiness score. Countries that are placed in the Western Europe region have a high happiness score, but countries that are placed in Sub-Saharan Africa have a low happiness score.

Explanation of hypothesis

An increase in GDP per capita means the increase in country’s production of goods and services. So it leads to economic growth, the quality of life in this country rises. So people became happier. An increase in social support means an increase in number of people, who have relatives or friends they can count on in the trouble. An increase in Freedom to make life choices leads to a rise in amount of satisfied people, who can participate in own life without any restrictions, forced by government, religion or society. An increase in generosity means more people participate in charity. Maybe they started to earn more money, so they have extra money to donate. An increase in healthy life expectance may represent the efficient contribution of government in the health care system, i.e. government spends a significant part of budget on the development of health care system. So the quality of life in this country increases. An increase in all these factors leads to an increase in quality of life, so the happiness score rises.
An increase in Corruption Perception means more residents think that corruption is widespread throughout the government and the business. So they lost trust in the government, juridical system, governmental structures. The quality of life decreases, which leads to falling in happiness score.
Most Western European countries are developed countries, so their economics is strong, they have low corruption, a high level of democracy, developed infrastructure, and a health care system. So the quality of life is significantly high there. So countries that are placed in Western Europe region have high happiness score Most Sub-Saharan Africa countries are developing countries, so their economics is weak, they have high corruption, low level of democracy, weak infrastructure, and health care system. So the quality of life is significantly low there. Countries that are placed in Sub-Saharan Africa have low happiness scores.

Note

It is difficult to judge the appropriateness of this regression model by just looking at the data table and without drawing proper graphs and conducting proper tests. So all conclusions about the linearity of Y, homogeneity of variance of residuals, independence of residuals, normality of residuals will be made in the ‘Data Analysis’ section.

Method

1. Linear modeling: happiness score vs region, GDP per capita, social support, healthy life expectancy, freedom to make life choices, generosity, perceptions of corruption, ladder score in dystopia
2. Multicollinearity:Scatterplot Matrix and the correlation Matrix, VIF
3. Model Selection: F-test, Added Variable plots, Stepwise regression, Model selection criterion : Adjusted \(R^2\), AIC, BIC, \(C_p\).
4. Residual analysis:
4.1) Independence of residuals: Sequence plot \(e_i\) vs \(i\) and Test for Independence
4.2) Normal distribution of residuals: QQ Plot, Shapiro-Wilk Test for Normality
4.3) Homogenity of Variance and linearity of \(\hat{Y}_{i}\): Residuals \(e_i\) vs Fitted values \(\hat{Y}_{i}\), Levene’s Test for Homogeneity of Variance
4.4) Ouliers: Studentized deleted residuals, leverages, DFFIT,Cook’s distance.
4.4) Multicollinearity: VIF
4.5) Added variable plots
5. Remedial measures and diagnostics/ residual analysis of new model.

Data Analysis

1) Linear Modeling

Table on which my project is based

knitr::opts_chunk$set(echo = TRUE)
#install.packages('plyr', repos = "http://cran.us.r-project.org")
#install.packages('XQuartz', repos = "https://www.xquartz.org/")
#install.packages("readxl")
library("readxl")
library("plyr")
#library("XQuartz")
#install.packages("xlsx")
#library(xlsx)

My_Data <- read_excel("Data2.xlsx",sheet=1,range="B2:K52", col_names = TRUE)

hapscore<-My_Data$`Ladder score`
Gdp<-My_Data$`GDP per capita`
socsup<-My_Data$`Social support`
lifeexp<-My_Data$`Healthy life expectancy`
freed<-My_Data$`Freedom to make life choices`
generos<-My_Data$`Generosity`
corrupt<-My_Data$`Perceptions of corruption`
dystopia<-My_Data$`Ladder score in Dystopia +residual`
region<-My_Data$`Regional indicator`


My_Data
## # A tibble: 50 x 10
##    Country `Regional indic… `Ladder score` `GDP per capita` `Social support`
##    <chr>   <chr>                     <dbl>            <dbl>            <dbl>
##  1 Finland Western Europe             7.84             10.8            0.954
##  2 Denmark Western Europe             7.62             10.9            0.954
##  3 Switze… Western Europe             7.57             11.1            0.942
##  4 Iceland Western Europe             7.55             10.9            0.983
##  5 Nether… Western Europe             7.46             10.9            0.942
##  6 Norway  Western Europe             7.39             11.1            0.954
##  7 Sweden  Western Europe             7.36             10.9            0.934
##  8 Luxemb… Western Europe             7.32             11.6            0.908
##  9 Austria Western Europe             7.27             10.9            0.934
## 10 New Ze… North America a…           7.28             10.6            0.948
## # … with 40 more rows, and 5 more variables: `Healthy life expectancy` <dbl>,
## #   `Freedom to make life choices` <dbl>, Generosity <dbl>, `Perceptions of
## #   corruption` <dbl>, `Ladder score in Dystopia +residual` <dbl>
my_model=lm(hapscore~ Gdp+socsup+lifeexp+freed+generos+corrupt+dystopia+region, data=My_Data)
summary(my_model)
## 
## Call:
## lm(formula = hapscore ~ Gdp + socsup + lifeexp + freed + generos + 
##     corrupt + dystopia + region, data = My_Data)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.084598 -0.004443  0.000513  0.006929  0.044054 
## 
## Coefficients:
##                                           Estimate Std. Error t value Pr(>|t|)
## (Intercept)                              -4.420514   0.149967 -29.477   <2e-16
## Gdp                                       0.330308   0.008476  38.968   <2e-16
## socsup                                    2.426241   0.075869  31.979   <2e-16
## lifeexp                                   0.029384   0.001367  21.496   <2e-16
## freed                                     1.304590   0.051786  25.192   <2e-16
## generos                                   0.598031   0.033647  17.774   <2e-16
## corrupt                                  -0.626058   0.022294 -28.082   <2e-16
## dystopia                                  0.981058   0.014448  67.901   <2e-16
## regionCommonwealth of Independent States -0.019576   0.014593  -1.341   0.1884
## regionEast Asia                          -0.033902   0.014643  -2.315   0.0266
## regionLatin America and Caribbean        -0.004396   0.013374  -0.329   0.7443
## regionMiddle East and North Africa        0.018483   0.014628   1.264   0.2147
## regionNorth America and ANZ               0.024950   0.017559   1.421   0.1642
## regionSub-Saharan Africa                  0.003464   0.028897   0.120   0.9053
## regionWestern Europe                      0.024608   0.013610   1.808   0.0792
##                                             
## (Intercept)                              ***
## Gdp                                      ***
## socsup                                   ***
## lifeexp                                  ***
## freed                                    ***
## generos                                  ***
## corrupt                                  ***
## dystopia                                 ***
## regionCommonwealth of Independent States    
## regionEast Asia                          *  
## regionLatin America and Caribbean           
## regionMiddle East and North Africa          
## regionNorth America and ANZ                 
## regionSub-Saharan Africa                    
## regionWestern Europe                     .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02035 on 35 degrees of freedom
## Multiple R-squared:  0.9995, Adjusted R-squared:  0.9993 
## F-statistic:  4750 on 14 and 35 DF,  p-value: < 2.2e-16

Comment:

By t-value test we reject \(H_0\): \(B_{i}\)=0 for intercept, GDP, social support, life expectancy, freedom to make life choices,generosity, perceptions of corruption and happiness score in dystopia. As p-val= \(2\)x\(10^{-16}\) and p-val<0.05. For categorical variable, region, only East Asia category is significant at \(\alpha\)=0.05 as p-val=0.0266 < 0.05 so we reject H0: \(B_{9}\)=0.
By F-test which tests if all \(B_{i}\)=0 or not, we can reject \(H_0\): \(B_{1}=B_{2}=...=B_{14}=0\) as p-val=\(2\)x\(10^{-16}\) < \(\alpha\)=0.05. So there is at least one nonzero predictor. So F-test for full model and t-test for each predictor in the presence of other predictors came to the same conclusion that there are nonzero \(B\)’s. So I assume there is no strong multicollinearity between predictors.
So \(\hat{Y}_{i} = -4.42+0.33*X_{i,1}+ 2.43*X_{i,2}+0.029*{X}_{i,3} +1.304*{X}_{i,4}+0.598*{X}_{i,5}-0.626058*{X}_{i,6}+ 0.981058*{X}_{i,7}-0.0339*{X}_{i,8}\) where \(X_{i,7}\)= 0 for counties that are not placed in East Asia and \(X_{i,7}\)= 1 for the countries that are placed in East Asia region.

2) Multicollinearity

Scatterplot Matrix and the correlation Matrix

To analyze multicollinearity between predictor variables I decided to draw scatterplot matrix and construct the correlation matrix.

My_Data2 <- read_excel("Data2.xlsx",sheet=1,range="D2:K52", col_names = TRUE)
pairs(My_Data2 , lower.panel = NULL)

cor(My_Data2 )
##                                    Ladder score GDP per capita Social support
## Ladder score                          1.0000000     0.76130086      0.7245603
## GDP per capita                        0.7613009     1.00000000      0.6913561
## Social support                        0.7245603     0.69135607      1.0000000
## Healthy life expectancy               0.6570834     0.78040574      0.6994376
## Freedom to make life choices          0.6025866     0.35556647      0.5145396
## Generosity                            0.4252662    -0.02444444      0.2424689
## Perceptions of corruption            -0.7345505    -0.49284577     -0.4568008
## Ladder score in Dystopia +residual    0.1956136    -0.29840614     -0.3286375
##                                    Healthy life expectancy
## Ladder score                                    0.65708344
## GDP per capita                                  0.78040574
## Social support                                  0.69943759
## Healthy life expectancy                         1.00000000
## Freedom to make life choices                    0.32798082
## Generosity                                     -0.04717927
## Perceptions of corruption                      -0.42376756
## Ladder score in Dystopia +residual             -0.38503564
##                                    Freedom to make life choices  Generosity
## Ladder score                                          0.6025866  0.42526621
## GDP per capita                                        0.3555665 -0.02444444
## Social support                                        0.5145396  0.24246886
## Healthy life expectancy                               0.3279808 -0.04717927
## Freedom to make life choices                          1.0000000  0.42708377
## Generosity                                            0.4270838  1.00000000
## Perceptions of corruption                            -0.5706967 -0.35728279
## Ladder score in Dystopia +residual                   -0.1046547  0.32589787
##                                    Perceptions of corruption
## Ladder score                                     -0.73455054
## GDP per capita                                   -0.49284577
## Social support                                   -0.45680078
## Healthy life expectancy                          -0.42376756
## Freedom to make life choices                     -0.57069671
## Generosity                                       -0.35728279
## Perceptions of corruption                         1.00000000
## Ladder score in Dystopia +residual               -0.03914784
##                                    Ladder score in Dystopia +residual
## Ladder score                                               0.19561365
## GDP per capita                                            -0.29840614
## Social support                                            -0.32863749
## Healthy life expectancy                                   -0.38503564
## Freedom to make life choices                              -0.10465470
## Generosity                                                 0.32589787
## Perceptions of corruption                                 -0.03914784
## Ladder score in Dystopia +residual                         1.00000000

Comment:

By Scatter plot Matrix we can see that Y (happiness score) has a strong linear relationship with each predictor variable, except Generosity and Ladder score in Dystopia, as the data distributed in narrow band, which has a linear shape and the correlation coefficient between Y and stated above predictor variables 0.602<|r|<0.762 (from the correlation Matrix). There is a medium positive relationship between Y (happiness score) and Generosity on the Scatter plot Matrix, as the points are distributed in wide band and r=0.42526621 (from the correlation Matrix). There is medium positive relationship between Y (happiness score) and Ladder score in Dystopia on the Scatter plot Matrix, but the correlation Matrix shows that r= 0.19561365, which is a small correlation.
The Ladder score in Dystopia with the other predictor variables has weak correlation as 0.038<|r|<0.39. It has a little collinearity with other predictors.
GDP has a strong positive linear relationship with social support (from Scatter plot Matrix) and r=0.691. Also, GDP has a strong positive linear relationship with healthy life expectancy (from Scatter plot Matrix), and r=0.7804. So GDP has a strong collinearity with social support and healthy life expectancy.
Social support has a strong positive linear relationship with helathy life expectancy as r=0.69944. So they are strongly collinear.

Note

I will not drop any predictor variable as the scatterplot shows the marginal relationship between Y and each predictor, but it does not give me any information about the joint relationship between Y (happiness score) and 8 predictor variables. Despite r= 0.19561365, which is a small correlation between Y and Ladder score in Dystopia, I will not drop Ladder score in Dystopia predictor. As seeing week marginal relationship between Y and Ladder score in Dystopia predictor does not mean that this predictor is not needed in a model including other predictors. Further tests are required.

F- test


Want to test whether we can drop q variables from a model that has p = k + 1 (including the intercept), p is number of \(\beta\)’s, q < p. 
\(H_0:\; B_{j,1}=B_{j,2}...=B_{j,q}=0\) in the full model
If \(H_0\) is true then p-value for the test is Pr(\(F^*\; >F_{q,n-p}\))
Use R to calculate p-value


I want to test Ladder score in Dystopia is necessary or not in the presence of other predictors. As from the Correlation Matrix there is a weak positive relationship between Y (happiness score) and Ladder score in Dystopia as r= 0.19561365, which is a small correlation.
\(H_0:\; \beta_7=0\) in the full model

NOTE: For all tests I will use 95% CI

my_model=lm(hapscore~ Gdp+socsup+lifeexp+freed+generos+corrupt+dystopia+region, data=My_Data)
model_reduced= update(my_model, . ~ . - dystopia)
anova(my_model, model_reduced)
## Analysis of Variance Table
## 
## Model 1: hapscore ~ Gdp + socsup + lifeexp + freed + generos + corrupt + 
##     dystopia + region
## Model 2: hapscore ~ Gdp + socsup + lifeexp + freed + generos + corrupt + 
##     region
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     35 0.0145                                  
## 2     36 1.9246 -1   -1.9101 4610.5 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Comment:

Reject \(H_0\): \(B_7\)=0 as p-value= \(2\)*\(10^{-16}\) < \(\alpha\)=0.05. So Ladder score in Dystopia is necessary in the presence of other predictors.

Variance inflation factor (VIF)

The predictor \(x_j\) has:
\(VIF_j=\frac{1}{1-R^{2}_j}\)
where \(R^{2}_j\) is the \(R^2\) from regressing \(x_j\) on the remaining predictors.
If \(VIF_j \approx 1\) then \(x_j\) is not involved in any multicollinearity.
If \(VIF_j >10\) then \(x_j\) is involved in severe multicollinearity.

library(car)
## Loading required package: carData
vif(my_model)
##               GVIF Df GVIF^(1/(2*Df))
## Gdp       7.143813  1        2.672791
## socsup    4.274933  1        2.067591
## lifeexp   5.971189  1        2.443602
## freed     2.479643  1        1.574688
## generos   2.233515  1        1.494495
## corrupt   2.687152  1        1.639253
## dystopia  2.855025  1        1.689682
## region   78.208205  7        1.365313

Comment:

From computations of Variance inflation factor we can see that VIF-s for all predictor variables except region are less than 10. So GDP, social support, life expectancy, freedom to make life choices,generosity, perceptions of corruption and happiness score in dystopia are not involved in severe multicollinearity, but VIF of region= 78.2 > 10, region is involved in severe multicollinearity.

Note

I decided to drop categorical variable ,region, as by t-test only East Asia region is significant. If country belongs to East Asia and other predictors are constant happiness score decreases by 0.033902 units. If country does not belong to East Asia and other predictors are constant happiness score stays unchanged. In my opinion, happiness score decreases by 0.033902 units is very small change. Also, the region is involved in severe multicollinearity. So I can just ignore and drop it.

3) Model selection

Added variable plots

Added variable plots are refined plots to help figure out if the “non-linear” pattern is there when other variables are added.
Gives an idea of the functional form of \(x_j\): a transformation in \(x_j\) should mimic the pattern seen in the plot.

#par(mfrow=c(4,2))

#Added var plot for x1
fit11 = lm(hapscore~ socsup+lifeexp+freed+generos+corrupt+dystopia, data=My_Data)
reYonithoutx1=fit11$residuals

fit12 = lm(Gdp ~ socsup+lifeexp+freed+generos+corrupt+dystopia, data=My_Data)
rex1onwithoutx1=fit12$residuals
plot(reYonithoutx1 ~ rex1onwithoutx1, main="Graph 1: Added-Variable Plot for x1 (GDP)", xlab="e(x1|others)", ylab="e(Y|Y on without x1) ")

#scatter.smooth(x=rex1onwithoutx1, y=reYonithoutx1, main="Graph 1: Added-Variable Plot for x1 (GDP)",xlab="e(x1|others)", ylab="e(Y|others without x1)") 


#Added var plot for x2
fit21 = lm(hapscore~ Gdp+lifeexp+freed+generos+corrupt+dystopia, data=My_Data)
reYonithoutx2=fit21$residuals

fit22 = lm(socsup~ Gdp+lifeexp+freed+generos+corrupt+dystopia, data=My_Data)
rex2onwithoutx2=fit22$residuals
plot(reYonithoutx2 ~ rex2onwithoutx2, main="Graph 2: Added-Variable Plot for x2 (social support)", xlab="e(x2|others)", ylab="e(Y|others without x2) ")

#Added var plot for x3
fit31 = lm(hapscore~ Gdp+socsup+freed+generos+corrupt+dystopia, data=My_Data)
reYonithoutx3=fit31$residuals

fit32 = lm(lifeexp~ Gdp+socsup+freed+generos+corrupt+dystopia, data=My_Data)
rex3onwithoutx3=fit32$residuals
plot(reYonithoutx3 ~ rex3onwithoutx3, main="Graph 3: Added-Variable Plot for x3 (life expectancy)", xlab="e(x3|others)", ylab="e(Y|others without x3) ")

#Added var plot for x4
fit41 = lm(hapscore~ Gdp+socsup+lifeexp+generos+corrupt+dystopia, data=My_Data)
reYonithoutx4=fit41$residuals

fit42 = lm(freed~ Gdp+socsup+lifeexp+generos+corrupt+dystopia, data=My_Data)
rex4onwithoutx4=fit42$residuals
plot(reYonithoutx4 ~ rex4onwithoutx4, main="Graph 4: Added-Variable Plot for x4 (freedom to make life choice)", xlab="e(x4|others)", ylab="e(Y|others without x4) ")

#Added var plot for x5
fit51 = lm(hapscore~ Gdp+socsup+lifeexp+freed+corrupt+dystopia, data=My_Data)
reYonithoutx5=fit51$residuals

fit52 = lm(generos~ Gdp+socsup+lifeexp+freed+corrupt+dystopia, data=My_Data)
rex5onwithoutx5=fit52$residuals
plot(reYonithoutx5 ~ rex5onwithoutx5, main="Graph 5: Added-Variable Plot for x5 (generosity)", xlab="e(x5|others)", ylab="e(Y|others without x5) ")

#Added var plot for x6
fit61 = lm(hapscore~ Gdp+socsup+lifeexp+freed+generos+dystopia, data=My_Data)
reYonithoutx6=fit61$residuals

fit62 = lm(corrupt~ Gdp+socsup+lifeexp+freed+generos+dystopia, data=My_Data)
rex6onwithoutx6=fit62$residuals
plot(reYonithoutx6 ~ rex6onwithoutx6, main="Graph 6: Added-Variable Plot for x6 (Perceptions of corruption)", xlab="e(x6|others)", ylab="e(Y|others without x6) ")

#Added var plot for x7
fit71 = lm(hapscore~ Gdp+socsup+lifeexp+freed+generos+corrupt, data=My_Data)
reYonithoutx7=fit71$residuals

fit72 = lm(dystopia~ Gdp+socsup+lifeexp+freed+generos+corrupt, data=My_Data)
rex7onwithoutx7=fit72$residuals
plot(reYonithoutx7 ~ rex7onwithoutx7, main="Graph 7: Added-Variable Plot for x7 (Ladder score in Dystopia)", xlab="e(x7|others)", ylab="e(Y|others without x7) ")

Comment:

By all 7 Added variable plots for each predictor variable we can see that there is a strong linear relationship between residuals(\(Y\)| \(x_{-j}\)) and residuals(\(x_{j}\)|\(x_{-j}\)). So every \(x_j\) explains residual variability once the rest of the predictors are in the model. Also as it is a linear relationship, I do not need to do any transformation on any \(x_j\). I think I do not need to add polynomial terms.

Automated variable search: Stepwise regression

The Stepwise regression is based on 2 methods: Forward selection when we add variables to the model and Backward elimination when we remove predictors from the model.
I added all possible predictors to model and choose \(\alpha_s=0.15\) and \(\alpha_e=0.1\).
There are 21 possible interactions terms, as 7C2=21 and 7 second order terms.
In total my model has 35 predictors

library(olsrr)
## 
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
## 
##     rivers
My_Data <- read_excel("Data2.xlsx",sheet=1,range="B2:K52", col_names = TRUE)
My_Data$GDP2=My_Data$`GDP per capita`^2
My_Data$socsup2=My_Data$`Social support`^2
My_Data$lifeexp2=My_Data$`Healthy life expectancy`^2
My_Data$freed2=My_Data$`Freedom to make life choices`^2
My_Data$generos2=My_Data$Generosity^2
My_Data$corrupt2=My_Data$`Perceptions of corruption`^2
My_Data$dystopia2=My_Data$`Ladder score in Dystopia +residual`^2

m2 = update(my_model, . ~ . + GDP2 + socsup2+lifeexp2+freed2+generos2+corrupt2+dystopia2)
m3 = update(m2,. ~ . + Gdp*socsup+Gdp*lifeexp+Gdp*freed+Gdp*generos+Gdp*corrupt+Gdp*dystopia)
m4 = update(m3,. ~ . + socsup*lifeexp+socsup*freed+socsup*generos+socsup*corrupt+socsup*dystopia)
m5 = update(m4,. ~ . + lifeexp*freed+lifeexp*generos+lifeexp*corrupt+lifeexp*dystopia)
m6 = update(m5,. ~ . + freed*generos+freed*corrupt+freed*dystopia)
m7 = update(m6,. ~ . + generos*corrupt+generos*dystopia+corrupt*dystopia)
ols_step_both_p(m7, pent = 0.1, prem = 0.15)
## 
##                                     Stepwise Selection Summary                                     
## --------------------------------------------------------------------------------------------------
##                              Added/                   Adj.                                            
## Step        Variable        Removed     R-Square    R-Square       C(p)           AIC        RMSE     
## --------------------------------------------------------------------------------------------------
##    1       Gdp:freed        addition       0.686       0.680     89704.7240      60.1247    0.4243    
##    2        dystopia        addition       0.828       0.821     49218.9360      32.1315    0.3177    
##    3        generos2        addition       0.829       0.818     48943.1970      33.8488    0.3202    
##    4        dystopia        removal        0.706       0.694     84077.5450      58.8860    0.4151    
##    5       dystopia2        addition       0.816       0.804     52594.5740      37.4435    0.3319    
##    6       Gdp:freed        removal        0.040       0.000    274631.7410     118.0525    0.7502    
##    7       Gdp:socsup       addition       0.907       0.901     26550.3170       3.3040    0.2359    
##    8        generos2        removal        0.907       0.903     26619.8270       1.4383    0.2337    
##    9         freed2         addition       0.955       0.952     12855.8020     -32.8743    0.1643    
##   10        corrupt2        addition       0.972       0.969      8014.7430     -54.4141    0.1313    
##   11        generos         addition       0.975       0.972      7096.5840     -58.4794    0.1250    
##   12        lifeexp         addition       0.995       0.994      1438.7830    -135.3015    0.0575    
##   13        corrupt         addition       0.995       0.994      1329.7000    -137.2170    0.0559    
##   14         freed2         removal        0.983       0.981      4790.4410     -76.0216    0.1040    
##   15         freed          addition       0.995       0.994      1334.7140    -137.0335    0.0560    
##   16       Gdp:socsup       removal        0.881       0.864     34063.6790      21.7373    0.2763    
##   17      Gdp:dystopia      addition       0.984       0.981      4672.8270     -75.2764    0.1039    
##   18        corrupt2        removal        0.983       0.981      4784.6380     -76.0818    0.1039    
##   19        socsup2         addition       0.998       0.998       541.5820    -180.3458    0.0363    
##   20         socsup         addition       0.998       0.998       464.2370    -185.7622    0.0341    
##   21        lifeexp2        addition       0.998       0.998       463.5230    -184.0364    0.0345    
##   22         socsup         removal        0.998       0.998       509.4890    -181.3988    0.0357    
##   23         region         addition       0.999       0.998       277.8100    -195.6407    0.0295    
##   24        socsup2         removal        0.986       0.980      4042.3100     -68.4920    0.1059    
##   25     socsup:lifeexp     addition       0.999       0.998       298.5910    -192.3742    0.0305    
##   26      Gdp:dystopia      removal        0.976       0.967      6753.2910     -42.9892    0.1366    
##   27      Gdp:lifeexp       addition       0.999       0.999       240.3300    -202.1326    0.0277    
##   28    lifeexp:dystopia    addition       0.999       0.999       163.6070    -217.3425    0.0236    
##   29          Gdp           addition       0.999       0.999       149.5130    -219.7291    0.0230    
##   30    lifeexp:dystopia    removal        0.999       0.999       241.0210    -200.3755    0.0280    
##   31    lifeexp:corrupt     addition       0.999       0.999       166.5010    -215.1098    0.0241    
##   32          Gdp           removal        0.999       0.999       186.7670    -211.6371    0.0250    
##   33          GDP2          addition       0.999       0.999       184.1090    -210.7336    0.0251    
##   34    lifeexp:corrupt     removal        0.999       0.999       241.9640    -200.2004    0.0281    
##   35    socsup:dystopia     addition       0.999       0.999       178.1240    -212.1785    0.0248    
##   36        lifeexp         removal        0.999       0.999       194.3530    -209.9016    0.0255    
##   37    generos:dystopia    addition       0.999       0.999       161.4690    -216.4342    0.0238    
##   38     socsup:lifeexp     removal        0.999       0.998       300.5140    -190.3860    0.0310    
##   39     socsup:corrupt     addition       0.999       0.999       209.7390    -204.9787    0.0266    
##   40      Gdp:lifeexp       removal        0.999       0.999       223.0500    -203.8323    0.0271    
##   41     freed:corrupt      addition       0.999       0.999       187.6980    -209.8867    0.0254    
##   42    generos:dystopia    removal        0.999       0.999       198.5850    -208.9589    0.0257    
##   43    corrupt:dystopia    addition       0.999       0.999       175.3200    -212.8701    0.0246    
## --------------------------------------------------------------------------------------------------

Comment:

I added all possible 21 interaction terms (7C2) and 7 possible quadratic terms. So in total I had 35 predictors in my model.
By stepwise selection the final model has 10 predictor variables:
1) 4 1st order variables: Generosity, Corruption, Freedom to make life choices, GDP. 3 1st order predictor variables such as Social support, Healthy life expectancy, Ladder score in Dystopia were deleted.
2) 3 2nd order terms: \(Ladder\; score\; in\; Dystopia^2\), \(Life\; expectancy^2\), \(GDP^2\).
3) 3 interaction terms: \(social\;support\times Ladder\; score\; in\; Dystopia\), \(social\; support\times corruption\), \(freedom\times corruption\).
I decided to leave all these predictor variables and add 3 1st order removed predictor variables such as Social support, Healthy life expectancy, Ladder score in Dystopia, as by added variable plots these predictors explain residual variability once the rest of the predictors are in the model.

Model selection criterion : Adjusted \(R^2\), AIC, BIC, \(C_p\)

1) Adjusted \(R^2\)
The adjusted \(R^2\) provides a measure of how good the model will predict data not used to build the model.
When irrelevant variables are added, adjusted \(R^2\) decreases.
The bigger \(R^2\), the better model is.

2) Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC)
AIC favors models with small SSE, but penalizes models with too many variables p.
The lower AIC, the better model is.
BIC is similar to AIC, but penalty term is more severe.
The lower BIC, the better model is.

3) Mallow’s \(C_p\) \(C_p\) assesses the biasness of model and fit of data.
If \(C_p\) ≈ p, then the reduced model predicts as well as the full model.
If \(C_p\) < p then the reduced model is estimated to be less biased than the full model
The lower \(C_p\) , the better model is.

#m12 = update(my_model, . ~ . -region+ GDP2 + lifeexp2+dystopia2)
#m13= update(m12, . ~ . +socsup*dystopia +socsup*corrupt+ freed*corrupt)

library(leaps)

allhapscore <- regsubsets( hapscore~ Gdp+socsup+lifeexp+freed+generos+corrupt+dystopia+GDP2 + lifeexp2+dystopia2+socsup*dystopia +socsup*corrupt+ freed*corrupt, nbest=4, data=My_Data)

all_output <- summary(allhapscore)
with(all_output, round(cbind(which, rsq, adjr2, cp, bic), 3))
##   (Intercept) Gdp socsup lifeexp freed generos corrupt dystopia GDP2 lifeexp2
## 1           1   0      0       0     0       0       0        0    1        0
## 1           1   1      0       0     0       0       0        0    0        0
## 1           1   0      0       0     0       0       0        0    0        0
## 1           1   0      0       0     0       0       1        0    0        0
## 2           1   0      0       0     0       0       0        0    1        0
## 2           1   1      0       0     0       0       0        0    0        0
## 2           1   0      0       0     0       0       0        0    0        1
## 2           1   0      0       1     0       0       0        0    0        0
## 3           1   1      0       0     0       0       1        0    0        0
## 3           1   0      0       0     0       0       1        0    1        0
## 3           1   1      0       0     0       0       0        0    0        0
## 3           1   0      0       0     0       0       0        0    1        0
## 4           1   0      0       0     1       0       0        0    1        0
## 4           1   0      0       0     1       0       1        0    1        0
## 4           1   1      0       0     1       0       0        0    0        0
## 4           1   1      0       0     1       0       1        0    0        0
## 5           1   0      0       0     1       0       0        0    1        1
## 5           1   0      0       1     1       0       0        0    1        0
## 5           1   1      0       0     1       0       0        0    0        1
## 5           1   1      0       1     1       0       0        0    0        0
## 6           1   0      0       0     1       1       0        0    1        1
## 6           1   0      0       1     1       1       0        0    1        0
## 6           1   1      0       0     1       1       0        0    0        1
## 6           1   1      0       1     1       1       0        0    0        0
## 7           1   1      1       1     1       1       0        1    0        0
## 7           1   1      1       1     1       1       1        1    0        0
## 7           1   1      1       1     1       1       0        1    0        0
## 7           1   0      1       1     1       1       0        1    1        0
## 8           1   1      1       1     1       1       1        1    0        0
## 8           1   1      1       1     1       1       0        1    0        0
## 8           1   1      1       1     1       1       0        1    1        0
## 8           1   1      1       1     1       1       0        1    0        0
##   dystopia2 socsup:dystopia socsup:corrupt freed:corrupt   rsq adjr2        cp
## 1         0               0              0             0 0.597 0.589 20447.298
## 1         0               0              0             0 0.580 0.571 21332.363
## 1         0               1              0             0 0.569 0.560 21869.964
## 1         0               0              0             0 0.540 0.530 23367.097
## 2         0               1              0             0 0.909 0.905  4608.316
## 2         0               1              0             0 0.904 0.900  4818.740
## 2         0               1              0             0 0.831 0.824  8534.126
## 2         0               1              0             0 0.826 0.819  8802.789
## 3         0               1              0             0 0.967 0.965  1623.003
## 3         0               1              0             0 0.967 0.965  1625.432
## 3         0               1              1             0 0.960 0.958  1979.901
## 3         0               1              1             0 0.959 0.957  2029.521
## 4         0               1              0             1 0.984 0.983   759.372
## 4         0               1              0             0 0.984 0.983   766.399
## 4         0               1              0             1 0.984 0.982   788.145
## 4         0               1              0             0 0.984 0.982   792.267
## 5         0               1              1             0 0.992 0.991   372.636
## 5         0               1              1             0 0.992 0.991   386.580
## 5         0               1              1             0 0.991 0.990   411.214
## 5         0               1              1             0 0.991 0.990   426.535
## 6         0               1              1             0 0.997 0.997    96.523
## 6         0               1              1             0 0.997 0.997   107.272
## 6         0               1              1             0 0.997 0.997   112.795
## 6         0               1              1             0 0.997 0.996   124.956
## 7         0               0              0             1 0.999 0.999     4.992
## 7         0               0              0             0 0.999 0.999     5.007
## 7         0               0              1             0 0.999 0.999     6.882
## 7         0               0              0             1 0.999 0.999     8.375
## 8         0               0              0             1 0.999 0.999     5.891
## 8         1               0              0             1 0.999 0.999     6.188
## 8         0               0              0             1 0.999 0.999     6.268
## 8         0               1              0             1 0.999 0.999     6.338
##        bic
## 1  -37.615
## 1  -35.501
## 1  -34.259
## 1  -30.955
## 2 -107.840
## 2 -105.628
## 2  -77.247
## 2  -75.705
## 3 -155.304
## 3 -155.232
## 3 -145.594
## 3 -144.382
## 4 -188.080
## 4 -187.643
## 4 -186.312
## 4 -186.064
## 5 -217.474
## 5 -215.805
## 5 -212.985
## 5 -211.308
## 6 -270.110
## 6 -266.210
## 6 -264.319
## 6 -260.391
## 7 -327.367
## 7 -327.348
## 7 -325.001
## 7 -323.208
## 8 -324.888
## 8 -324.498
## 8 -324.392
## 8 -324.301

Comment:

By selection model criterion the best model has the lowest BIC=-327.367, lowest \(C_p\)= 4.992 and the highest adjusted \(R^2\)= 0.999. This model has 7 predictor variables such as GDP, Social support, Healthy life expectancy, Freedom to make life choices, Generosity, Ladder score in Dystopia and one interaction term freedom*corruption.

Note:

By looking at selection model criterion, constructing different plots and applying Stepwise Selection method I decided to use linear model Y (happiness score) and 8 predictor variables: GDP, Social support, Healthy life expectancy, Freedom to make life choices, Generosity, Perceptions of corruption,Ladder score in Dystopia, freedom*corruption.

4) Residual analysis

Usually, we do not make any diagnostic of the response variable Y because Y=f(\(x_1\),\(x_2\),..,\(x_k\)), where \(x_i\) is a the predictor variable for i=1,2,…,k. So usually diagnostics of Y are carried out indirectly through residual analysis, as \(e_i=Y_i-\hat Y_i\).
We need to learn basic assumptions on which simple linear regression theory is built.

Assumptions of MLR

\(\epsilon_1,\epsilon_2....\epsilon_n \backsim^{iid} N(0,\sigma^2)\)
A linear relationship between E[Y] and associated predictors \(x_1\), . . . , \(x_k\) .

If the model is appropriate for the given data, the residuals \(e_i\) should reflect the assumed properties for the error terms \(\epsilon_i\).

After the model fit and before any conclusions are made we need to check:
1. Normality of residuals
2. Homogeneity of variance of residuals
3. Linearity of Y
4. Independence of residuals
5. The predictor variables are not too highly correlated with each other.
6. There are no influential outliers

4.1) Checking independence of residuals

a) Sequence plot \(e_i\) vs \(i\)

The plotting \(e_i \; against\; i\) i.e residuals vs its index allows to determine if there is any correlation between error terms, such we can check the independence of residuals.

library(lawstat)
## 
## Attaching package: 'lawstat'
## The following object is masked from 'package:car':
## 
##     levene.test
library(latex2exp)

my_model_fin=lm(hapscore~ Gdp+socsup+lifeexp+freed+generos+corrupt+dystopia+corrupt*freed, data=My_Data)
re = my_model_fin$residuals


plot(re, main="Graph 3: Time Series Plot of the Residuals", xlab=TeX("i"), ylab=TeX("e_i"))
lines(re)
abline(h=0)

b) Runs Test for Independence


\(H_0:\; e_i\; are\; independent\; i=1,2,...,n\)
\(H_a:\; e_i\; are\; not\; independent\; i=1,2,...,n\)
Use R to calculate p.m.f. of random variable U
\(p-value=\Pr(U\leq u)\)

NOTE: For all tests I will use 95% CI

 library(lawstat)
 runs.test(re, plot.it=TRUE)

## 
##  Runs Test - Two sided
## 
## data:  re
## Standardized Runs Statistic = 0, p-value = 1

Note

Here A is non-negative residuals \(e_i\geq0\), B is negative residuals \(e_i<0\), u=25. p-value=1 or 100%

Comment

According to Graph 3 residuals are independent, as there is no repeated pattern, and I do not see a clear relationship between \(e_i\) and \(i\). Any \(e_j\) can not be predicted by the previous resudials. So there does not exist any dependence.
Also, The Run Test of Independence failed to reject \(H_0:\; e_i\; are\; independent\; i=1,2,...,n\) as p-value of 2 sided Runs Test is 1 or 100% is more than \(\frac{\alpha}{2}\) i.e. 0.025.
Also, p-value=100% is more than any usual used \(\frac{\alpha}{2}\). Usually \(\alpha\) is \(1 \leq\alpha\leq10\) in %. Then \(0.5 \leq\frac{\alpha}{2}\leq5\) in %. So \(H_a:\; e_i\; are\; independent\; i=1,2,...,n\) is true. So, the graph 3 and The Run Test for Independence show the same conclusion, the independence of residuals.

4.2) Checking normality of residuals

To check the normality of residuals we can construct Normal Q-Q plot of the residuals. Here each residual is plotted against its expected value under normality.
Points located near/ on a straight line suggest that these residuals are normally distributed, whereas points that depart from a straight line suggest that these residuals are not normally distributed.

a) Normal Q-Q Plot

qqnorm(re)
qqline(re)

hist(re, xlab='residuals', main='Graph 4: Histogram of residuals', freq=FALSE)
curve(dnorm(x, mean(re), sd(re)), add=TRUE)

Comment

By Normal Q-Q Plot of residuals it is difficult to determine residuals are normal distributed or not. Residuals in the range of theoretical quantiles from -1 till 1 are lies very close or on a normal curve, so they are normally distributed and other residuals with theoretical quantiles > 1 or < -1 lie with the change of theoretical quantiles depart far away from normal curve.
According to graph 4, it is clear seen that residuals are normally distributed, as histogram has a normal curve’s shape. We can see that normal distribution is negative skew as it has a small left tail.
So I decided to conduct Shapiro-Wilk test to decide the residuals are normal distributed or not.

b) Shapiro-Wilk Test for Normality


\(H_0: e_1,e_2,...,e_n\sim N\)
\(H_a: e_1,e_2,...,e_n\nsim N\)

Test statistics W and p-value is calculated by R.

shapiro.test(re)
## 
##  Shapiro-Wilk normality test
## 
## data:  re
## W = 0.61696, p-value = 3.507e-10

Comment

By Shapiro-Wilk Test for Normality W=0.61696 and p-value =3.507*\(10^{-10}\) or approximately 0%. As I used 95% CI \(\alpha=0.05\; or\; 5\)%. \(p-value=3.507*10^{-10}\;<\;\alpha=0.05\), so reject \(H_0: e_1,e_2,...,e_n\sim N\) and accept \(H_a\) . It means the residuals are not normally distributed.

4.3) Checking Homogeneity of Variance and linearity of \(\hat{Y}_{i}\)

a) Residuals vs Fitted values plot for checking Homogeneity of Variance and linearity of \(\hat{Y}_{i}\)

To check the linearity and homogeneity of Variance assumptions, we can plot Residuals \(e_i\) vs Fitted values \(\hat Y_i\) graph. If the variance of residuals is homogeneous, we expect to see a constant spread/distance of the residuals to the 0 line across all \(\hat Y_i\) values. If the linear model is a good fit, we expect to see the residuals evenly spread on either side of the 0 line.

fits = my_model_fin$fitted.values

plot(re ~ fits,ylim=c(-0.05,0.05), main="Graph 5: Residuals vs Fitted values", xlab=TeX("\\hat{Y}_i "),
     ylab=TeX("e_i "))
#scatter.smooth(re ~ fits, main="Residuals vs Fitted values", xlab=TeX("\\hat{Y}_i "), ylab=TeX("e_i"))
abline(h=0)

Comment

Firstly, according to Graph 5, there can be parabolic relationship between \(e_i\) and \(\hat Y_i\), as residuals are not evenly spread around y=0 line. So there is nonlinear relation between \(\hat Y_i\) and \(x_1,x_2,...,x_8\). Secondly, residuals do not depart from y=0 or become closer to y=0 with the increase of \(\hat Y_i\). So there is a homogeneity of variance of residuals.

b) Levene’s Test for Homogeneity of Variance

Split responses (happiness score) into t distinct groups based on predictor values (GDP per capita)
\(H_0: \sigma_1^2=...=\sigma_t^2\)
\(H_a: \sigma_1^2\neq...\neq\sigma_t^2\)
\(Test\; Statistics\;(T.S.)\sim F_{t-p,n-t}\)
where t is number of groups, n is number of observations
The \(p-value=\Pr(F_{t-p,n-t}\geq T.S.)\)

library(lawstat)
#Look at plot Graph 1 to divide data into 3 groups
breaks = c(8, 9,10, 12)
groups = cut(Gdp, breaks)
levene.test(re, groups)
## 
##  Modified robust Brown-Forsythe Levene-type test based on the absolute
##  deviations from the median
## 
## data:  re
## Test Statistic = 0.54928, p-value = 0.581

Comment

For applying Levene’s Test I divided my data into 3 groups [\(8\leq GDP<9\), \(9\leq GDP<10\),\(10\leq GDP<12\) ] based on predictor values, GDP. The calculated p=0.581 or 58.1% and it is more than \(\alpha=0.05\) or 5% and even more than any usual \(\alpha\). Usually \(\alpha\) is \(1 \leq\alpha\leq10\) in %. So we fail to reject \(H_0: \sigma_1^2=...=\sigma_t^2\), which means the variance of residuals is constant/homogeneous. So I got the same result as from graph 5: Residuals vs Fitted values.
Note that the calculated p-value depends on how we divide our data into groups, what interval we used and on the number of intervals/groups.

4.4) Outliers

Studentized deleted residuals

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:olsrr':
## 
##     cement
#Compute the studentized delted residuals
stud.del.res <- studres(my_model_fin)
print(stud.del.res)
##             1             2             3             4             5 
##  -1.348584716  -0.381198858   0.160782908  -0.146475114   0.488515870 
##             6             7             8             9            10 
##  -0.328987531  -0.090739330   0.395778947   0.291583961   0.282675555 
##            11            12            13            14            15 
##   0.530326432   0.549005610   0.652611931   0.156666039   0.621176616 
##            16            17            18            19            20 
##   0.350273584   0.526139413  -0.432639572   1.965599849  -0.268210053 
##            21            22            23            24            25 
##  -0.286285476  -0.433784527   0.513166122   0.016485114  -0.584524003 
##            26            27            28            29            30 
##  -0.529138874   0.219122854   0.293306297  -0.233357910   0.162354143 
##            31            32            33            34            35 
##  -0.008648775  -0.316575228  -0.118426326   0.171064698   0.008750378 
##            36            37            38            39            40 
##   0.987168299 -78.681573781  -0.368297800  -0.031418401  -0.645481346 
##            41            42            43            44            45 
##   1.413019328  -0.550967177   0.044865465   0.852585266   0.235144025 
##            46            47            48            49            50 
##  -0.259758149   0.628112241   0.588333880  -0.187605154   0.504442809
#Calculate the Bonferroni critical value for outliers:

alpha = 0.05
t <- qt(1-alpha/(2*length(stud.del.res)),
                 length(stud.del.res)-9-1)
                  
sprintf("The critical value is t= %s", t)
## [1] "The critical value is t= 3.55096576086335"
paste0("Any studentized residuals > ",t," ?  ",
       any(abs(stud.del.res) > t))
## [1] "Any studentized residuals > 3.55096576086335 ?  TRUE"

Comment

By studentized deleted residual the critical t = 3.550966. By comparing absolute value of Studentized deleted residuals with the critical t, one outlier was determined. Observation 37, which |Studentized deleted residuals| = 78.681573781 is outlier as 78.68> critical t=3.5099. So \(Y_{37}\) is outlier.

Leverage

diagonal <- lm.influence(my_model_fin)$hat
print("The diagonal elements of the hat matrix")
## [1] "The diagonal elements of the hat matrix"
print(diagonal)
##          1          2          3          4          5          6          7 
## 0.37450674 0.18513628 0.09777545 0.19337525 0.12145252 0.17398166 0.13927137 
##          8          9         10         11         12         13         14 
## 0.12235626 0.06612678 0.12963523 0.07609614 0.18530789 0.17906369 0.16011798 
##         15         16         17         18         19         20         21 
## 0.19421057 0.14370538 0.23756278 0.12367323 0.25655680 0.13195533 0.09190666 
##         22         23         24         25         26         27         28 
## 0.31177785 0.07900939 0.05365264 0.15854458 0.18691244 0.13473540 0.06063661 
##         29         30         31         32         33         34         35 
## 0.18073585 0.16710458 0.19470378 0.07931640 0.16929044 0.16238816 0.22395241 
##         36         37         38         39         40         41         42 
## 0.16500939 0.28477023 0.08154887 0.09363863 0.25039959 0.34557368 0.08775988 
##         43         44         45         46         47         48         49 
## 0.34423890 0.28777524 0.21189543 0.39623731 0.15889273 0.35668983 0.22370615 
##         50 
## 0.16532963
paste0("Any h_ii  > ",2*mean(diagonal)," ?  ",
       any(diagonal > 2*mean(diagonal)))
## [1] "Any h_ii  > 0.36 ?  TRUE"

Comment

By rule of thumb any leverage \(h_{ii}\) > 2\(\bar{h}\) is flagged as having “high” leverage. In my project 2\(\bar{h}\)=0.36. By rule of thumb there are 2 influential points. They are observation 1 and observation 46. \(h_{1,1}\)=0.37450674 and \(h_{46,46}\)= 0.39623731, they > 2\(\bar{h}\)=0.36. As they are more than 2\(\bar{h}\), my model may be extrapolating far outside the general region of my data.

** Cook’s distance**

#g) Calculate Cook's distance D; for each case and prepare an index plot. Are any cases
#influential according to this measure?

table <- data.frame(My_Data$`Ladder score`,cooks.distance(my_model_fin),pf(cooks.distance(my_model_fin),9,length(My_Data$`Ladder score`)-9))
colnames(table) = c("Y", "Cooks distance", "F percentile")
head(table)
##       Y Cooks distance F percentile
## 1 7.842   0.1186217970 9.650723e-04
## 2 7.620   0.0037464181 2.828066e-10
## 3 7.571   0.0003188568 4.396187e-15
## 4 7.554   0.0005854712 6.763173e-14
## 5 7.464   0.0037350515 2.789801e-10
## 6 7.392   0.0025892886 5.392405e-11
index=1:length(My_Data$`Ladder score`)
plot(index,cooks.distance(my_model_fin), type="l", col="red", lwd=3, xlab="Case index", ylab="Cook's distance", main="Index plot")

paste0("Any F percentile  > ",0.5," ?  ",
       any(pf(cooks.distance(my_model_fin),9,length(My_Data$`Ladder score`)-9) > 0.5))
## [1] "Any F percentile  > 0.5 ?  TRUE"

Comment

To detect high Cook’s distance we calculate F(p,n-p), where n is a sample size and p is number of \(\beta\)’s. Compare F of each Cook’s distance with 0.5. If F>0.5, then this Cook’s distance is high.
By table of Cook’s distance and Index plot there is one outlier. It is case 37. Cook’s distance of case 37 is 1.802160 and its F percentile = 0.9027 > 0.5. So case 37 has disproportionate influence on the fitted regression surface as a whole.

DFFITs

table2 <- data.frame(My_Data$`Ladder score`,dffits(my_model_fin))
colnames(table2) = c("Y", "DFFIT")
head(table2)
##       Y       DFFIT
## 1 7.842 -1.04351030
## 2 7.620 -0.18169998
## 3 7.571  0.05292946
## 4 7.554 -0.07171806
## 5 7.464  0.18163491
## 6 7.392 -0.15098588

Comment

In table 2 of DFFIT we can see that observation 37 has very high absolute value of DFFIT = 49.647517358 relatively to absolute value DFFIT of other observations, which are in range [0;1.1]. Also DFFIT of observation 37 > 1. So, observation 37 is outlier.

Note

By computation of Studentized deleted residuals, leverages, Cook’s distances, DFFITs 3 ouliers were detected. They are case 37, case 1, case 46. They may affect the fitted regression function more than other points. If the outlying points follow the modeling assumptions and are representative, they may strengthen inference and reduce error in predictions. If not, outlying values may skew inference a lot and yield models with poor predictive properties.

4.5) Multicollinearity

VIF

library(car)

vif(my_model_fin)
##           Gdp        socsup       lifeexp         freed       generos 
##      3.178866      3.076378      3.159432     49.048744      1.786673 
##       corrupt      dystopia freed:corrupt 
##    373.473056      1.551261    274.017663
#mod=update(my_model_fin,. ~ . -freed*corrupt+freed+corrupt)
#vif(mod)

Comment

By calculating VIF we can see that corrruption, freedom and \(freedom\times corruption\) are multicollinear as their VIFs >10. I expected that they will be involved in a severe multicollinearity as \(freedom\times corruption\) is an interaction term between corruption and freedom.

4.6) Added variable plots

From Graph 5: Residuals \(e_i\) vs Fitted values \(\hat Y_i\), I came to conclusion that there is parabolic relationship between Y and \(x_1\), \(x_2\), ….,\(x_8\). So I need to make a transformation on predictors to fix this violation. Added variable plots help to reveal which transformation on which predictor I should apply.

My_Data$corruptfreed=My_Data$`Perceptions of corruption`*My_Data$`Freedom to make life choices`

#Added var plot for x1
fit11 = lm(hapscore~ socsup+lifeexp+freed+generos+corrupt+dystopia+corruptfreed, data=My_Data)
reYonithoutx1=fit11$residuals

fit12 = lm(Gdp ~ socsup+lifeexp+freed+generos+corrupt+dystopia+corruptfreed, data=My_Data)
rex1onwithoutx1=fit12$residuals
#plot(reYonithoutx1 ~ rex1onwithoutx1, main="Graph 1: Added-Variable Plot for x1 (GDP)", xlab="e(x1|others)", ylab="e(Y|Y on without x1) ")
scatter.smooth(x=rex1onwithoutx1, y=reYonithoutx1, main="Graph 1: Added-Variable Plot for x1 (GDP)",xlab="e(x1|others)", ylab="e(Y|others without x1)") 

#Added var plot for x2
fit21 = lm(hapscore~ Gdp+lifeexp+freed+generos+corrupt+dystopia+corruptfreed, data=My_Data)
reYonithoutx2=fit21$residuals

fit22 = lm(socsup~ Gdp+lifeexp+freed+generos+corrupt+dystopia+corruptfreed, data=My_Data)
rex2onwithoutx2=fit22$residuals
#plot(reYonithoutx2 ~ rex2onwithoutx2, main="Graph 2: Added-Variable Plot for x2 (social support)", xlab="e(x2|others)", ylab="e(Y|others without x2) ")
scatter.smooth(x=rex2onwithoutx2, y=reYonithoutx2, main="Graph 2: Added-Variable Plot for x2 (social support)", xlab="e(x2|others)", ylab="e(Y|others without x2) ") 

#Added var plot for x3
fit31 = lm(hapscore~ Gdp+socsup+freed+generos+corrupt+dystopia+corruptfreed, data=My_Data)
reYonithoutx3=fit31$residuals

fit32 = lm(lifeexp~ Gdp+socsup+freed+generos+corrupt+dystopia+corruptfreed, data=My_Data)
rex3onwithoutx3=fit32$residuals

#plot(reYonithoutx3 ~ rex3onwithoutx3, main="Graph 3: Added-Variable Plot for x3 (life expectancy)", xlab="e(x3|others)", ylab="e(Y|others without x3) ")
scatter.smooth(x=rex3onwithoutx3, y=reYonithoutx3, main="Graph 3: Added-Variable Plot for x3 (life expectancy)", xlab="e(x3|others)", ylab="e(Y|others without x3) ") 

#Added var plot for x4
fit41 = lm(hapscore~ Gdp+socsup+lifeexp+generos+corrupt+dystopia+corruptfreed, data=My_Data)
reYonithoutx4=fit41$residuals

fit42 = lm(freed~ Gdp+socsup+lifeexp+generos+corrupt+dystopia+corruptfreed, data=My_Data)
rex4onwithoutx4=fit42$residuals

#plot(reYonithoutx4 ~ rex4onwithoutx4, main="Graph 4: Added-Variable Plot for x4 (freedom to make life choice)", xlab="e(x4|others)", ylab="e(Y|others without x4) ")

scatter.smooth(x=rex4onwithoutx4, y=reYonithoutx4, main="Graph 4: Added-Variable Plot for x4 (freedom to make life choice)", xlab="e(x4|others)", ylab="e(Y|others without x4) ")

#Added var plot for x5
fit51 = lm(hapscore~ Gdp+socsup+lifeexp+freed+corrupt+dystopia+corruptfreed, data=My_Data)
reYonithoutx5=fit51$residuals

fit52 = lm(generos~ Gdp+socsup+lifeexp+freed+corrupt+dystopia+corruptfreed, data=My_Data)
rex5onwithoutx5=fit52$residuals

#plot(reYonithoutx5 ~ rex5onwithoutx5, main="Graph 5: Added-Variable Plot for x5 (generosity)", xlab="e(x5|others)", ylab="e(Y|others without x5) ")

scatter.smooth(x=rex5onwithoutx5, y=reYonithoutx5, main="Graph 5: Added-Variable Plot for x5 (generosity)", xlab="e(x5|others)", ylab="e(Y|others without x5) ")

#Added var plot for x6
fit61 = lm(hapscore~ Gdp+socsup+lifeexp+freed+generos+dystopia+corruptfreed, data=My_Data)
reYonithoutx6=fit61$residuals

fit62 = lm(corrupt~ Gdp+socsup+lifeexp+freed+generos+dystopia+corruptfreed, data=My_Data)
rex6onwithoutx6=fit62$residuals

#plot(reYonithoutx6 ~ rex6onwithoutx6, main="Graph 6: Added-Variable Plot for x6 (Perceptions of corruption)", xlab="e(x6|others)", ylab="e(Y|others without x6) ")

scatter.smooth(x=rex6onwithoutx6, y=reYonithoutx6, main="Graph 6: Added-Variable Plot for x6 (Perceptions of corruption)", xlab="e(x6|others)", ylab="e(Y|others without x6) ")

#Added var plot for x7
fit71 = lm(hapscore~ Gdp+socsup+lifeexp+freed+generos+corrupt+corruptfreed, data=My_Data)
reYonithoutx7=fit71$residuals

fit72 = lm(dystopia~ Gdp+socsup+lifeexp+freed+generos+corrupt+corruptfreed, data=My_Data)
rex7onwithoutx7=fit72$residuals

#plot(reYonithoutx7 ~ rex7onwithoutx7, main="Graph 7: Added-Variable Plot for x7 (Ladder score in Dystopia)", xlab="e(x7|others)", ylab="e(Y|others without x7) ")

scatter.smooth(x=rex7onwithoutx7, y=reYonithoutx7, main="Graph 7: Added-Variable Plot for x7 (Ladder score in Dystopia)", xlab="e(x7|others)", ylab="e(Y|others without x7) ")

#Added var plot for x8
fit81 = lm(hapscore~ Gdp+socsup+lifeexp+freed+generos+corrupt+dystopia, data=My_Data)
reYonithoutx8=fit81$residuals

fit82 = lm(corruptfreed~ Gdp+socsup+lifeexp+freed+generos+corrupt+dystopia, data=My_Data)
rex8onwithoutx8=fit82$residuals

#plot(reYonithoutx8 ~ rex8onwithoutx8, main="Graph 8: Added-Variable Plot for x8 (corruption*freedom)", xlab="e(x8|others)", ylab="e(Y|others without x8) ")

scatter.smooth(x=rex8onwithoutx8, y=reYonithoutx8, main="Graph 8: Added-Variable Plot for x8 (corruption*freedom)", xlab="e(x8|others)", ylab="e(Y|others without x8) ")

Comment:

By 7 Added variable plots for predictor variables such as GDP, social support, life expectancy, freedom to make life choice, generosity, ladder score in Dystopia, Perceptions of corruption, we can see that there is a strong linear relationship between residuals(\(Y\)| \(x_{-j}\)) and residuals(\(x_{j}\)|\(x_{-j}\)). So every \(x_j\) explains residual variability once the rest of the predictors are in the model. Also, as it is linear relation, I do not need to do any transformation on any \(x_j\).
However on graph Graph 8: Added-Variable Plot for \(x_8\) (corruption*freedom) there is horizontal line, it means that \(x_8\) predictor does not explain any residual variability once the rest of the predictors are in the model.

5) Remedial Measures

Remedial measures are some procedures that we can apply to fix problems of our Regression model such as Nonlinear Relation, Non-Constant Variance, Non-Independence of Errors, Non-Normality of Errors, Outliers.

According to the results of the tests and analysis of the plots, my Regression model holds only 2 basic assumptions of MLR such as homogeneity of variance of residuals, independence of residuals.
By making diagnostics, I found out that 3 basic assumptions were violated. Residuals of the model have non-normal distribution. There is non-linear relationship between Y and \(x_1\),\(x_2\),…,\(x_8\). Also, 3 predictors (corruption, freedom, \(freedom\times corruption\)) are involved into severe multicollinearity.
Added variable plots for each predictor show that that \(x_8\) (\(corruption\times freedom\)) predictor does not explain any residual variability once the rest of the predictors are in the model.
By computation of Studentized deleted residuals, leverages, Cook’s distances, DFFITs 3 outliers were detected. There are 3 outliers. They are case 37, case 1, case 46. All methods except leverage detected case 37 as an outlier.

To fix these problems I decided to remove \(x_8\) (\(corruption\times freedom\)) predictor and observation 37 from the model to fix these violations.

A new model

I created new sheet in my Excel table and manually deleted observation 37 from it. Also, I removed \(x_8\) (\(corruption\times freedom\)) predictor from the model.

require(foreign)
## Loading required package: foreign
require(MASS)
library("readxl")
library("plyr")


new_Data <- read_excel("Data2.xlsx",sheet=2,range="B2:K51", col_names = TRUE)

hapscore_new<-new_Data$`Ladder score`
Gdp_new<-new_Data$`GDP per capita`
socsup_new<-new_Data$`Social support`
lifeexp_new<-new_Data$`Healthy life expectancy`
freed_new<-new_Data$`Freedom to make life choices`
generos_new<-new_Data$`Generosity`
corrupt_new<-new_Data$`Perceptions of corruption`
dystopia_new<-new_Data$`Ladder score in Dystopia +residual`

irls= lm(hapscore_new~ Gdp_new+socsup_new+lifeexp_new+freed_new+generos_new+corrupt_new+dystopia_new, data=new_Data)
summary(irls)
## 
## Call:
## lm(formula = hapscore_new ~ Gdp_new + socsup_new + lifeexp_new + 
##     freed_new + generos_new + corrupt_new + dystopia_new, data = new_Data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0074348 -0.0008362  0.0002282  0.0009887  0.0042172 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -4.556e+00  7.508e-03  -606.8   <2e-16 ***
## Gdp_new       3.498e-01  5.120e-04   683.1   <2e-16 ***
## socsup_new    2.239e+00  5.983e-03   374.2   <2e-16 ***
## lifeexp_new   3.148e-02  9.204e-05   342.0   <2e-16 ***
## freed_new     1.220e+00  4.272e-03   285.6   <2e-16 ***
## generos_new   6.574e-01  2.766e-03   237.7   <2e-16 ***
## corrupt_new  -6.372e-01  1.703e-03  -374.1   <2e-16 ***
## dystopia_new  9.991e-01  9.521e-04  1049.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.001828 on 41 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.168e+06 on 7 and 41 DF,  p-value: < 2.2e-16

Comment:

By t-value test we reject \(H_0\): \(B_{i}\)=0 for all predictors in new model. As p-val= <\(2\)x\(10^{-16}\) and p-val<0.05.
By F-test which tests if all \(B_{i}\)=0 or not, we can reject \(H_0\): \(B_{1}=B_{2}=...=B_{14}=0\) as p-val= <\(2\)x\(10^{-16}\) < \(\alpha\)=0.05. So there is at least one nonzero predictor. So F-test for full model and t-test for each predictor in the presence of other predictors came to the same conclusion that there are nonzero \(\beta\)’s.
Also, adjusted \(R^2=1\), which means our model will predict data not used to build the model very well. Also, the model has MSE=0.
So \(\hat{Y}_{i} = -4.556+0.349*X_{i,1}+ 2.239*X_{i,2}+0.031*{X}_{i,3} +1.22*{X}_{i,4}+0.657*{X}_{i,5} -0.637*{X}_{i,6}+ 0.999*{X}_{i,7}\).

Note

After changing my model I will make diagnostics again to see does new model follow the basic assumptions of MLR.

5.1) Residual analysis: homogeneity,independence, normality of residuals and linearity between Y and \(x_1\),….,\(x_7\)

# 1) Independence of residuals

library(lawstat)
library(latex2exp)

re_new = irls$residuals

#a) plot: e_i vs i
  
plot(re_new, main="Graph 9: Time Series Plot of the Residuals", xlab=TeX("i"), ylab=TeX("e_i"))
lines(re_new)
abline(h=0)

#b) Runs Test for Independence

 library(lawstat)
 runs.test(re_new, plot.it=TRUE)

## 
##  Runs Test - Two sided
## 
## data:  re_new
## Standardized Runs Statistic = 0.43624, p-value = 0.6627
#4.2) Checking normality of residuals

#a) Normal Q-Q Plot

qqnorm(re_new,ylim=c(-0.01,0.01))
qqline(re_new)

#b) Histogram of residuals
hist(re_new, xlab='residuals', main='Graph 10: Histogram of residuals', freq=FALSE)
curve(dnorm(x, mean(re_new), sd(re_new)), add=TRUE)

#c) Shapiro-Wilk Test for Normality
shapiro.test(re_new)
## 
##  Shapiro-Wilk normality test
## 
## data:  re_new
## W = 0.8808, p-value = 0.0001374
#Checking Homogeneity of Variance and linearity of  $\hat{Y}_{i}$

fits = irls$fitted.values
plot(re_new ~ fits, ylim=c(-0.01,0.01), main="Graph 11: Residuals vs Fitted values", xlab=TeX("\\hat{Y}_i "), ylab=TeX("e_i "))
abline(h=0)

#b) Levene’s Test for Homogeneity of Variance

library(lawstat)
#Look at plot Graph 1 to divide data into 3 groups
breaks = c(8, 9,10, 12)
groups = cut(Gdp_new, breaks)
levene.test(re_new, groups)
## 
##  Modified robust Brown-Forsythe Levene-type test based on the absolute
##  deviations from the median
## 
## data:  re_new
## Test Statistic = 0.8438, p-value = 0.4366

Comment

1) Independence of residuals

According to Graph 9: Time Series Plot of the Residuals, residuals are independent, as there is no repeated pattern, and I do not see a clear relationship between \(e_i\) and \(i\). Any \(e_j\) can not be predicted by the previous residuals. So there does not exist any dependence.
Also, The Run Test of Independence failed to reject \(H_0:\; e_i\; are\; independent\; i=1,2,...,n\) as p-value of 2 sided Runs Test is 0.6627 or 66.27% is more than \(\frac{\alpha}{2}\) i.e. 0.025.
Also, p-value=66.27% is more than any usual used \(\frac{\alpha}{2}\). Usually \(\alpha\) is \(1 \leq\alpha\leq10\) in %. Then \(0.5 \leq\frac{\alpha}{2}\leq5\) in %. So \(H_a:\; e_i\; are\; independent\; i=1,2,...,n\) is true. So, the graph 9 and The Run Test for Independence show the same conclusion, the independence of residuals.

2) Normality of residuals

By Normal Q-Q Plot of residuals it is difficult to determine residuals are normal distributed or not. Residuals in the range of theoretical quantiles from -1.5 till 1.5 lie very close to or on a normal curve, so they are normally distributed and other residuals with theoretical quantiles > 1.5 or < -1.5 with the change of theoretical quantiles depart far away from normal curve.
According to graph 10: Histogram of Residuals, it is clear seen that residuals are normally distributed, as histogram has a normal curve’s shape. We can see that normal distribution is negative skew as it has small left tail.
So I decided to conduct Shapiro-Wilk test to decide the residuals are normal distributed or not. By Shapiro-Wilk Test for Normality W=0.8808 and p-value =0.0001374 or approximately 0.01%. As I used 95% CI \(\alpha=0.05\; or\; 5\)%. p-value=0.0001374 < \(\alpha=0.05\) so I reject \(H_0: e_1,e_2,...,e_n\sim N\) and accept \(H_a\) . It means the residuals are not normally distributed.

3) Homogeneity of Variance and linearity of \(\hat{Y}_{i}\)

Firstly, according to Graph 11: Residuals vs Fitted values, there is no clear pattern between \(e_i\) and \(\hat Y_i\).Residuals evenly spread on either side of the 0 line. So there is linear relationship between Y and \(x_1\),…,\(x_7\). Secondly, residuals do not depart from y=0 or become closer to y=0 with the increase of \(\hat Y_i\). So there is a homogeneity of variance of residuals.
For applying Levene’s Test I divided my data into 3 groups [\(8\leq GDP<9\), \(9\leq GDP<10\),\(10\leq GDP<12\) ] based on a predictor variable, GDP. The calculated p=0.4366 or 43.66% and it is more than \(\alpha=0.05\) or 5% and even more than any usual \(\alpha\). Usually \(\alpha\) is \(1 \leq\alpha\leq10\) in %. So we fail to reject \(H_0: \sigma_1^2=...=\sigma_t^2\), which means the variance of residuals is constant/homogeneous. So I got the same result as from graph 11.

5.2) Outliers

###Studentized deleted residuals

library(MASS)

#Compute the studentized delted residuals
stud.del.res <- studres(irls)
print(stud.del.res)
##           1           2           3           4           5           6 
##  1.37853191  1.44257055 -0.22329394  0.41614910 -0.42731904 -1.29208499 
##           7           8           9          10          11          12 
## -0.30308858 -0.48334884  0.32019755  0.30333339  0.30650499 -0.21404107 
##          13          14          15          16          17          18 
##  0.49038819 -5.58297177  0.80430395 -1.03439399  0.29997286 -0.39757212 
##          19          20          21          22          23          24 
##  3.00933372  1.16524367  0.76054440  0.39172197 -0.39633936  0.60236250 
##          25          26          27          28          29          30 
##  0.15849904 -0.38297904 -0.56620787 -1.20048595 -0.51007232  1.67819486 
##          31          32          33          34          35          36 
## -0.88149730  0.12836370  0.67187034 -0.50111583  0.09714979  0.16210621 
##          37          38          39          40          41          42 
## -0.02608556 -0.19853905  0.74944596 -0.77263544 -0.96330073 -0.59465185 
##          43          44          45          46          47          48 
##  0.68894665 -0.43254825 -1.08390061  0.46971753  0.20196557  0.60395332 
##          49 
##  0.66009937
#Calculate the Bonferroni critical value for outliers:

alpha = 0.05
t <- qt(1-alpha/(2*length(stud.del.res)),
                 length(stud.del.res)-8-1)
                  
sprintf("The critical value is t= %s", t)
## [1] "The critical value is t= 3.54395250327835"
paste0("Any studentized residuals > ",t," ?  ",
       any(abs(stud.del.res) > t))
## [1] "Any studentized residuals > 3.54395250327835 ?  TRUE"

Leverage

diagonal <- lm.influence(irls)$hat
print("The diagonal elements of the hat matrix")
## [1] "The diagonal elements of the hat matrix"
print(diagonal)
##          1          2          3          4          5          6          7 
## 0.34660147 0.14322717 0.09759096 0.18386491 0.11300372 0.09791491 0.10487565 
##          8          9         10         11         12         13         14 
## 0.12152539 0.05680724 0.12910794 0.07859997 0.14412735 0.09769284 0.07927742 
##         15         16         17         18         19         20         21 
## 0.16549908 0.14440774 0.24189521 0.12296642 0.29722135 0.13474573 0.09301971 
##         22         23         24         25         26         27         28 
## 0.30423764 0.08293250 0.05192022 0.15803448 0.16819974 0.13601217 0.06304312 
##         29         30         31         32         33         34         35 
## 0.18083375 0.16694097 0.16035696 0.07707504 0.16197947 0.16114961 0.11699467 
##         36         37         38         39         40         41         42 
## 0.15853531 0.06362119 0.08849877 0.24360833 0.32887598 0.09246225 0.33568260 
##         43         44         45         46         47         48         49 
## 0.28239961 0.20506457 0.39659426 0.15585249 0.28654103 0.21085848 0.16772259
paste0("Any h_ii  > ",2*mean(diagonal)," ?  ",
       any(diagonal > 2*mean(diagonal)))
## [1] "Any h_ii  > 0.326530612244898 ?  TRUE"

** Cook’s distance**

table3 <- data.frame(new_Data$`Ladder score`,cooks.distance(irls),pf(cooks.distance(irls),8,length(new_Data$`Ladder score`)-8))
colnames(table3) = c("Y", "Cooks distance", "F percentile")
head(table3)
##       Y Cooks distance F percentile
## 1 7.842    0.123299724 2.049038e-03
## 2 7.620    0.042368352 3.861277e-05
## 3 7.571    0.000690006 3.182157e-12
## 4 7.554    0.004977283 8.475499e-09
## 5 7.464    0.002967094 1.078587e-09
## 6 7.392    0.022287390 3.190128e-06
index=1:length(new_Data$`Ladder score`)
plot(index,cooks.distance(irls), type="l", col="red", lwd=3, xlab="Case index", ylab="Cook's distance", main="Index plot")

paste0("Any F percentile  > ",0.5," ?  ",
       any(pf(cooks.distance(irls),8,length(new_Data$`Ladder score`)-8) > 0.5))
## [1] "Any F percentile  > 0.5 ?  FALSE"

DFFITs

table4 <- data.frame(new_Data$`Ladder score`,dffits(irls))
colnames(table4) = c("Y", "DFFIT")
head(table4)
##       Y      DFFIT
## 1 7.842  1.0040213
## 2 7.620  0.5898165
## 3 7.571 -0.0734311
## 4 7.554  0.1975229
## 5 7.464 -0.1525238
## 6 7.392 -0.4256883

Comment

1) Studentized deleted residuals
By studentized deleted residuals the critical t = 3.5439. By comparing absolute value of Studentized deleted residuals with the critical t, the one outlier was determined. Observation 14, which |Studentized deleted residuals| = 5.58297177 is outlier as 5.58297177 > critical t=3.5439. So \(Y_{14}\) is outlier.

2) Leverage
By rule of thumb any leverage \(h_{ii}\) > 2\(\bar{h}\) is flagged as having “high” leverage. In my project 2\(\bar{h}\)=0.32653. By rule of thumb there are 3 influential points. They are observation 1, observation 40, observation 42. \(h_{1,1}\)=0.34660147, \(h_{40,40}\)= 0.32887598, \(h_{42,42}\)= 0.33568260, they > 2\(\bar{h}\)=0.32653. As they are more than 2\(\hat{h}\), my model may be extrapolating far outside the general region of my data.

3) Cook’s distance
To detect high Cook’s distance we calculate F(p,n-p), where n is a sample size and p is number of \(\beta\)’s. Compare F of each Cook’s distance with 0.5. If F>0.5, then this Cook’s distance is high.
By Index plot there is one observation, observation 19, has a high Cook’s distance (0.4001318) relatively to others. However, I do not think it is influential as by table of Cook’s distance its F percentile=0.08600397 < 0.5.

4) DFFITs
In table 4 of DFFIT we can see that observation 14 (DFFIT = -1.638233500), observation 19 (DFFIT = 1.957046176) have very high absolute value of DFFIT relatively to absolute value DFFIT of other observations, which are in range [0;1]. Also |DFFIT| of these observations > 1. So, they are outliers.

5.3) VIF and Added variable plots

VIF

library(car)

vif(irls)
##      Gdp_new   socsup_new  lifeexp_new    freed_new  generos_new  corrupt_new 
##     3.204327     3.237903     3.283171     1.931430     1.866073     1.939538 
## dystopia_new 
##     1.522190

Comment

By calculating VIF we can see that VIFs of each predictor are in range [1.5;3.3]. So every VIF < 10. This means that there is no predictor, which is involved in a severe multicollinearity.

Added variable plots

#Added var plot for x1
fit11 = lm(hapscore~ socsup+lifeexp+freed+generos+corrupt+dystopia, data=My_Data)
reYonithoutx1=fit11$residuals

fit12 = lm(Gdp ~ socsup+lifeexp+freed+generos+corrupt+dystopia, data=My_Data)
rex1onwithoutx1=fit12$residuals

#plot(reYonithoutx1 ~ rex1onwithoutx1, main="Graph 1: Added-Variable Plot for x1 (GDP)", xlab="e(x1|others)", ylab="e(Y|Y on without x1) ")

scatter.smooth(x=rex1onwithoutx1, y=reYonithoutx1, main="Graph 1: Added-Variable Plot for x1 (GDP)",xlab="e(x1|others)", ylab="e(Y|others without x1)") 

#Added var plot for x2
fit21 = lm(hapscore~ Gdp+lifeexp+freed+generos+corrupt+dystopia, data=My_Data)
reYonithoutx2=fit21$residuals

fit22 = lm(socsup~ Gdp+lifeexp+freed+generos+corrupt+dystopia, data=My_Data)
rex2onwithoutx2=fit22$residuals

#plot(reYonithoutx2 ~ rex2onwithoutx2, main="Graph 2: Added-Variable Plot for x2 (social support)", xlab="e(x2|others)", ylab="e(Y|others without x2) ")

scatter.smooth(x=rex2onwithoutx2, y=reYonithoutx2,main="Graph 2: Added-Variable Plot for x2 (social support)", xlab="e(x2|others)", ylab="e(Y|others without x2) ")

#Added var plot for x3
fit31 = lm(hapscore~ Gdp+socsup+freed+generos+corrupt+dystopia, data=My_Data)
reYonithoutx3=fit31$residuals

fit32 = lm(lifeexp~ Gdp+socsup+freed+generos+corrupt+dystopia, data=My_Data)
rex3onwithoutx3=fit32$residuals

#plot(reYonithoutx3 ~ rex3onwithoutx3, main="Graph 3: Added-Variable Plot for x3 (life expectancy)", xlab="e(x3|others)", ylab="e(Y|others without x3) ")

scatter.smooth(x=rex3onwithoutx3, y=reYonithoutx3, main="Graph 3: Added-Variable Plot for x3 (life expectancy)", xlab="e(x3|others)", ylab="e(Y|others without x3) ")

#Added var plot for x4
fit41 = lm(hapscore~ Gdp+socsup+lifeexp+generos+corrupt+dystopia, data=My_Data)
reYonithoutx4=fit41$residuals

fit42 = lm(freed~ Gdp+socsup+lifeexp+generos+corrupt+dystopia, data=My_Data)
rex4onwithoutx4=fit42$residuals

#plot(reYonithoutx4 ~ rex4onwithoutx4, main="Graph 4: Added-Variable Plot for x4 (freedom to make life choice)", xlab="e(x4|others)", ylab="e(Y|others without x4) ")

scatter.smooth(x=rex4onwithoutx4, y=reYonithoutx4,main="Graph 4: Added-Variable Plot for x4 (freedom to make life choice)", xlab="e(x4|others)", ylab="e(Y|others without x4) ")

#Added var plot for x5
fit51 = lm(hapscore~ Gdp+socsup+lifeexp+freed+corrupt+dystopia, data=My_Data)
reYonithoutx5=fit51$residuals

fit52 = lm(generos~ Gdp+socsup+lifeexp+freed+corrupt+dystopia, data=My_Data)
rex5onwithoutx5=fit52$residuals

#plot(reYonithoutx5 ~ rex5onwithoutx5, main="Graph 5: Added-Variable Plot for x5 (generosity)", xlab="e(x5|others)", ylab="e(Y|others without x5) ")

scatter.smooth(x=rex5onwithoutx5, y=reYonithoutx5,main="Graph 5: Added-Variable Plot for x5 (generosity)", xlab="e(x5|others)", ylab="e(Y|others without x5) ")

#Added var plot for x6
fit61 = lm(hapscore~ Gdp+socsup+lifeexp+freed+generos+dystopia, data=My_Data)
reYonithoutx6=fit61$residuals

fit62 = lm(corrupt~ Gdp+socsup+lifeexp+freed+generos+dystopia, data=My_Data)
rex6onwithoutx6=fit62$residuals

#plot(reYonithoutx6 ~ rex6onwithoutx6, main="Graph 6: Added-Variable Plot for x6 (Perceptions of corruption)", xlab="e(x6|others)", ylab="e(Y|others without x6) ")

scatter.smooth(x=rex6onwithoutx6, y=reYonithoutx6,main="Graph 6: Added-Variable Plot for x6 (Perceptions of corruption)", xlab="e(x6|others)", ylab="e(Y|others without x6) ")

#Added var plot for x7
fit71 = lm(hapscore~ Gdp+socsup+lifeexp+freed+generos+corrupt, data=My_Data)
reYonithoutx7=fit71$residuals

fit72 = lm(dystopia~ Gdp+socsup+lifeexp+freed+generos+corrupt, data=My_Data)
rex7onwithoutx7=fit72$residuals

#plot(reYonithoutx7 ~ rex7onwithoutx7, main="Graph 7: Added-Variable Plot for x7 (Ladder score in Dystopia)", xlab="e(x7|others)", ylab="e(Y|others without x7) ")

scatter.smooth(x=rex7onwithoutx7, y=reYonithoutx7,main="Graph 7: Added-Variable Plot for x7 (Ladder score in Dystopia)", xlab="e(x7|others)", ylab="e(Y|others without x7) ")

Comment:

By all 7 Added variable plots for each predictor variable we can see that there is a strong linear relationship between residuals(\(Y\)| \(x_{-j}\)) and residuals(\(x_{j}\)|\(x_{-j}\)). So every \(x_j\) explains residual variability once the rest of the predictors are in the model, so I need to keep all 7 predictors in my model. Also, as it is linear relationship, I do not need to apply any transformation on any \(x_j\).

Conclusion

Conclusion

I regress Y (happiness score) on my 8 predictors (region,GDP,corruption,freedom to make life choice,social support, healthy life expectancy,generosity,Ladder score in Dystopia). After conducting the appropriate tests and plotting appropriate graphs I came to the conclusion that I need to drop categorical variable, Regional indicator, as by F-test it is not necessary and it has a small influence on the fitted value.
Then I made model selection by applying Stepwise regression and looking at values of model selection criterion such as Adjusted \(R^2\), \(AIC\), \(C_p\). I got a new model, which was based on 8 predictors (GDP,corruption,freedom to make life choice,social support, healthy life expectancy,generosity,Ladder score in Dystopia, \(freedom\times corruption\)).
After applying the studied on the lectures methods I found out that new regression model holds the homogeneity of variance, the normality, the independence assumptions. However, it violates the linearity between Y and \(x_1,...,x_8\), multicollinearity and the model has 3 outliers.The most influential observation was observation 37. By calculating VIF and constructing added variable plot I came to conclusion that an interaction term \(freedom\times corruption\) was involved in severe multicollinearity and did not explain any residual variability.
As a remedial measure, I dropped interaction term \(freedom\times corruption\) from the model and removed observation 37.
I made diagnostics again. The new model violates only normality of residuals assumption and has 4 outliers (observation 1, observation 14, observation 40, observation 42). However, these outliers do not negatively affect the regression model, because adjusted \(R^2=1\), which means the model will predict data not used to build the model very well and the model has MSE=0. Also, by F-test and t-test the presence of each predictor is required and Added Variable plots showed that each predictor explains the residual variability in the presence of other predictors.
Also, I can ignore non-normality of residuals, the sample size =50 is large enough, as according to Central Limit Theorem the sampling distribution of the estimates will converge toward a normal distribution as N increases to infinity, when residuals are independent and identically distributed, and when variance of residuals is homogeneous.
So my remedial measure was efficient and I successfully find the right model, which holds all basic assumptions of MLR.

Evaluation

I would like to mention some suggestions that can improve my analysis:

1. Use larger samples, try to find more observations.
The larger size of sample you have, the more precise results you will get and the more accurate your regression model will be. Also, if sample size is large enough, non-normality of residuals can be ignored.

2. Use another predictor variables instead of social support,freedom to make life choices, Generosity, Corruption Perception
All these 4 predictor variables are the national average of the binary responses (either 0 or 1) to the GWP questions, where answers “Yes” or “No”. For calculating them researchers make a survey of residents. However, opinions of residents can be biased. For example, there can be residents that never traveled to another countries and did not see the quality of life in another countries, that’s why they think that the quality of life in their countries are bad. So I would like to replace them with approved world recognized economic indexes. For example instead of using Corruption Perception from the World Happiness Report 2021, I would like to use Corruption Perceptions Index (CPI) next time. Generosity can be replaced by annually amount of donations made by the residents of a particular country.
Also, these variables are categorical nominal. If I was one of author of World Happiness Report 2021, I would like to make them ordinal. Instead of asking GWP questions, that have only 2 answers (“Yes” or “No”), I would like to ask GWP question such as from 1 to 10 please assess the freedom to choose our life, where 1 - you can’t make life decisions, 10- you have absolute freedom to make life decisions by yourself.

3. I would like to add new predictor variables to the model
The response variable Y in my project was Happiness score (ladder score), which is the national average response to the question of life evaluations. The experts ask the residents to assess their own life in this country using a scale from 0 (the worst possible life) to 10 (the best possible life). I think I need to add new predictors such as HDI, literacy rate, unemployment rate, poverty rate, Gini index, to my model, as all these economic indexes have a direct relation to assessment of quality of life in countries. For example, Gini index helps to measure income inequality in the country. The addition of these predictors to my model can improve the fit of the model and its prediction property.

References

1. The data for the analysis was taken from World Happiness Report 2021: https://worldhappiness.report/ed/2021/.
2. The analysis is based on the lecture slides of professor Zhenisbek Assylbekov and on Chapter 6, Chapter 7, Chapter 8, Chapter 9, Chapter 10 and Chapter 11 of Kutner’s “Applied Linear Statistical Models” textbook.